class: center, middle, inverse, title-slide # Lecture 17 ## Correlation and Linear Regression ### Psych 10 C ### University of California, Irvine ### 05/09/2022 --- ## Correlation - Up to this point we have looked at problems with a single dependent variable, however, we could have more than one. -- - When we have 2 dependent variables we could ask ourselves, how are those two variables associated? -- - The correlation between two variables is a measure of the association between them. -- - Formally, it's a measure of how the two variables change together. For example, what happens to the values of one variable as the second one increases? -- - The correlation coefficient measures this degree of association and we denote this value with `\(R\)`. --- ## Correlation - The correlation coefficient can take positive, negative values or it can be 0. -- - However, it will always be bounded between -1 and 1, for example: -- .pull-left[ <img src="data:image/png;base64,#lec-17_files/figure-html/cor-1-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="data:image/png;base64,#lec-17_files/figure-html/cor-2-1.png" style="display: block; margin: auto;" /> ] --- ## Correlation - A correlation of 0 should look like this: <img src="data:image/png;base64,#lec-17_files/figure-html/cor-0-1.png" style="display: block; margin: auto;" /> --- ## Correlation - Therefore, the sign of the correlation indicates whether one variable **increases** as the other one **increases** (**positive R**) or if one variable **decreases** as the other one **increases** (**negative R**). -- - On the other hand, the magnitude of R indicates the "strength" of the associations, with values closer to 1 (or -1) representing a stronger association between the variables. -- .pull-left[ <img src="data:image/png;base64,#lec-17_files/figure-html/cor-08-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="data:image/png;base64,#lec-17_files/figure-html/cor-02-1.png" style="display: block; margin: auto;" /> ] --- ## Correlation - An example of a negative correlation would look like: .pull-left[ <img src="data:image/png;base64,#lec-17_files/figure-html/cor-n08-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#lec-17_files/figure-html/cor-n02-1.png" style="display: block; margin: auto;" /> ] --- ## Correlation - Regardless of the sign of a correlation, we say that a correlation is stronger the closer it is to 1 (or -1). -- - For example, a correlation of 0.5 implies a stronger association between the variables than a correlation of 0.2. -- - A correlation of `\(-0.3\)` also implies a stronger association between the variables than a correlation of `\(0.2\)` -- - Correlation measures the association between two variables, however, in this class we are not interested on associations but on using one variable to **predict** the values of another. -- - For example, when we have two groups like the year when a student started college, we would like to use that in order to predict (know if it affects) some other variable like their anxiety levels. -- - Linear regression builds on the concept of a correlation and allows us to make **predictions** about the values of a dependent variable using the values of an independent variable. --- ## Linear Regression - To this point we have been using categorical variables in order to make predictions about data. However, what should we do if we have a continuous independent variable, or a variable that can take many different values like income or IQ in the previous graphs? -- - For example, say that we want to predict a student's grade based on the number of classes that they missed during a quarter? -- What is the independent variable in this example? .can-edit.key-likes[ - **ANS:** ] -- What is the dependent variable in this example? .can-edit.key-likes[ - **ANS:** ] -- - How can we make use of the information from our independent variable to predict the values of our dependent variable? --- ## Linear Regression - Let's look at some data first, we have the number of classes missed by 4 students and their corresponding grades on a statistics class: -- .pull-left[ <img src="data:image/png;base64,#lec-17_files/figure-html/miss-grade-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="data:image/png;base64,#lec-17_files/figure-html/miss-grade-lm-1.png" style="display: block; margin: auto;" /> ] --- ## Linear Reression - We have used averages before to predict the values of our observations in an experiment (averages by group or by combinations of factor levels). The mean was our best prediction for the expected value of a dependent variable. -- - However, when we have a continuous independent variable or a numeric variable that can take many values, we can use **lines** instead of averages. -- - In other words, we will use the equation of a line to predict the values of our dependent variable using the value of our independent variable. -- - The equation for the line is: `$$\beta_0 + \beta_1x_i$$` -- - Where `\(x_i\)` represents the value of our independent variable associated with our `\(i-th\)` observation. This variable is also known as the **predictor**. -- - In our previous example, `\(x_1 = 2\)` which means that the first student in the sample missed two classes, and `\(x_2 = 8\)` which means that the second student in our sample missed 8 classes. --- ## Parameters of the equation line - There are 2 parameters in the equation of a line: -- 1. `\(\beta_0\)`: which is known as **the intercept** and it represents our prediction when the independent variable is set to 0 (prediction when `\(x = 0\)`). -- 2. `\(\beta_1\)`: which is known as **the slope** and it represents how much our dependent variable changes when the value of our independent variable `\(x\)` increases by **one** unit. -- - In our grades and missed classes example, **the intercept** `\(\beta_0\)` would be interpreted as the predicted grade of a student that didn't miss a single class. -- - On the other hand, **the slope** `\(\beta_1\)` would be interpreted as the number of points that a student is expected to loose (because the line is going down) for each missed class. -- - How can we get the predictions for different values of `\(x\)`? --- ## Predictions of a linear model - Say that when a student misses 0 classes hey are expected to get 98 points on average. -- - This means that we can set `\(\beta_0 = 98\)` in our line equation. -- - Additionally, assume that students are expected to lose 5 points on average for each class they miss during the quarter. -- - This means that `\(\beta_1 = - 5\)` points in our line equation. -- - What would be the average predicted number of points for students that missed 2, 4, 5 and 8 classes during the quarter? Remember that the equation of the line is `\(\beta_0 + \beta_1 x_i\)` -- | Classes missed | Average predicted grade | |----------------|:-----------------------:| | 2 | .can-edit[] | | 4 | .can-edit[] | | 5 | .can-edit[] | | 8 | .can-edit[] | --- ## Simple linear regression - When we only have one **predictor** (just one `\(x\)`) we refer to this model as simple linear regression. -- - As we will see later on the course, we can have more than one predictor in our models. When we use more that one **predictor** in our models (more than one `\(x\)`) we call it Multiple Linear Regression. -- - As with other models in the class, the key step will be to find the predictions of the model for each of observation in order to calculate the SSE of the model. -- - Like we did in the example, we will get our predictions by using the equation of the line. -- - However, the problem is how do we find the best **intercept** `\((\beta_0)\)` and **slope** `\((\beta_1)\)`? -- - The short answer is that we have to ask a computer... -- - However, I believe that it's always better to try to understand the intuition behind how we find those values. To do that we need to use Math. --- ## Parameters in a linear regression - First let's think about what it means to have different values for the **intercept** `\((\beta_0)\)` and **slope** `\((\beta_1)\)`. -- - Each combination of those values will give us a different line. -- - However, regardless of the line, our data is always fixed. The points in our graph are just the grades of the students and those grades don't change because we use different lines to predict them. -- .pull-left[ <img src="data:image/png;base64,#lec-17_files/figure-html/miss-grade-lm1-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#lec-17_files/figure-html/miss-grade-lm2-1.png" style="display: block; margin: auto;" /> ] --- ## Parameters in a linear regression - As we saw in the previous graphs, each line will have a different distance (black arrows) between prediction (purple line) and observation (blue points). -- - These distances are the same as in other models, they are the difference between prediction and observation, or what we call **error**. -- - When we square those values and add them all together we get the Sum of Squared Errors. -- - This means that each line (different combination of the values of `\(\beta_0\)` and `\(\beta_1\)`) that we can draw on our graph will have a different SSE. -- - Therefore, we can look for the line that has the smallest average distance from all the points, or the smallest **SSE**. -- - In other words, we want to find the values of `\(\beta_0\)` and `\(\beta_1\)` that minimize the Sum of Squared Errors. --- ## Estimators for linear models - This is not only the case for the simple linear regression, but for every linear model. -- - When we had two groups and we used the average of all participants in the experiment as the prediction for the Null model, this was because that average minimizes the SSE of a model that assumes that all groups follow the same Normal distribution. -- - When we used the average of the groups is because the two averages minimize the distance (SSE) between the prediction and observation for a model that assumes that there are two different groups. -- - The **estimators** that we have used for each model have all been found using the same idea. -- - This is the general intuition behind the **estimators** used on any linear model (two groups, multiple groups, factorial designs and linear regression). --- ## Estimators for linear models - When we formalize our models with some parameters (like `\(\beta_0\)` and `\(\beta_1\)` in the simple linear regression model) our objective will always be to find the values of those parameters that minimize the Sum of Squared Errors. This method is known Least Squares. -- - Every **estimator** that we have seen in the class has been obtained this way. -- - Because of this, they are known as **Least Squares Estimators**. -- - This is the name that you might find in an introductory statistics textbook, and it means that the **estimator** was found by looking for the value of a parameter in a model that makes the SSE as small as possible. -- - However, this does not mean that the value we obtain is the smallest among all possible models, but just that it will be the smallest value of that specific model. -- - For example, almost all our Null models use the average response of all participants because when we only have one parameter, the average minimizes the SSE.